This module created with support form NSF under grant # DUE 1141022 Module developed Spring 2013 By Wuxu Peng Parallel Performance – Basic Concepts.

Slides:

Advertisements

Similar presentations

Parallelism Lecture notes from MKP and S. Yalamanchili.

Advertisements

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Distributed Systems CS

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Topics Parallel Computing Shared Memory OpenMP 1.

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.

CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Parallel Programming in C with MPI and OpenMP

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

This module created with support form NSF under grant # DUE Module developed Fall 2014 by Apan Qasem Parallel Computing Fundamentals Course TBD.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

CS 345: Chapter 10 Parallelism, Concurrency, and Alternative Models Or, Getting Lots of Stuff Done at Once.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Chapter 9: Alternative Architectures In this course, we have concentrated on single processor systems But there are many other breeds of architectures:

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.

Classic Model of Parallel Processing

Parallel Computing.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Advanced Computer Networks Lecture 1 - Parallelization 1.

The Central Processing Unit (CPU)

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Modern Information Retrieval

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Concurrency and Performance Based on slides by Henri Casanova.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

These slides are based on the book:

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Overview Parallel Processing Pipelining

Introduction to Parallelism.

EE 193: Parallel Computing

Distributed System Structures 16: Distributed Structures

What is Parallel and Distributed computing?

CSE8380 Parallel and Distributed Processing Presentation

Distributed computing

By Brandon, Ben, and Lee Parallel Computing.

Chapter 2: Operating-System Structures

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Chapter 4 Multiprocessors

Chapter 2: Operating-System Structures

Presentation transcript:

This module created with support form NSF under grant # DUE Module developed Spring 2013 By Wuxu Peng Parallel Performance – Basic Concepts

TXST TUES Module : # Improving Computer Performance 2

TXST TUES Module : # Running into a Wall with the Clock 3

TXST TUES Module : # Clock Speeds Peaked in ca

TXST TUES Module : # Parallelism Unifies “Non-clock” Routes 5

TXST TUES Module : # What is Parallel Computing  Traditional serial (sequential) computing:  Uses single computer with single CPU to solve problem  Problem solved using finite # of hardware-executable operations  Operations executed one after another and one at a time  Parallel computing:  Uses multiple computing resources to solve problem concurrently  Multiple computing resources may be  Single computer with multiple computing components (component may be entire processor, or nearly so → multicore computer)  Multiple computers connected by some interconnecting link (e.g., network)  Combination of both  Problem broken into parts that can be solved in parallel  If this can’t be done to a useful extent → little use for parallel computing  Parallel computing challenge → develop/implement parallelizing ideas  Each part solved using finite # of hardware-executable operations  Operations for each part executed simultaneously  Using the multiple computing resources 6

TXST TUES Module : # Parallelism Unifies “Non-clock” Routes 7

TXST TUES Module : # Parallelism Unifies “Non-clock” Routes 8

TXST TUES Module : # Parallelism Unifies “Non-clock” Routes 9

TXST TUES Module : # Parallelism Unifies “Non-clock” Routes 10

TXST TUES Module : # Parallel Computing has its Limit  Parallel computing:  …  Problem broken into parts that can be solved in parallel  If this can’t be done to a useful extent → little use for parallel computing  Parallel computing challenge → develop/implement parallelizing ideas  …  Amdahl’s law:  Gives commonsense ceiling on achievable performance gain  Speedup a better way can give → limited by the usability of the better way  Quantitatively: 11 Memory Aid Note: Speedup is in terms of “# of times as fast”. %Unusable = (1 - %Usable)

TXST TUES Module : # N.B.  Trends (manifested in ever faster networks, distributed systems, and multiprocessor computers) of the past many years point to “future of computing lies in parallelism”.  Current supercomputers are all parallel computers.  Parallel computers entered mainstream computing with the advent of multicore computers.  Without efficient parallel algorithms, a parallel computer is like Austin Powers loses his mojo!  It’s harder to write parallel programs than sequential ones:  There must be even distribution of work over processors.  Amount of inter-processor communication must be curbed.  If not programmed in portable fashion, may run fast on certain architectures but surprisingly slow on others.  How’d it be like if there’re to be no parallel/concurrent activities?  (Less mind-boggling warm-up: How’d it be like if there’s no gravity?) 12

TXST TUES Module : # Coaxing Mr. Quick & Mr. Parallel to Work Wonders  Trends (manifested in ever faster networks, distributed systems, and multiprocessor computers) of the past many years point to “future of computing lies in parallelism”  …  Without efficient parallel algorithms, a parallel computer is like Austin Powers loses his mojo!  How safe is quicksort’s mojo in the world of parallel computing?  Divide-and-conquer nature → feels right at home  Individual in-place partitioning → doesn’t seem to sit well with Amdahl  (where parallelizing challenge lies?)  Trails on parallelizing quicksort → rather well-trodden  Only selective sampling of these for our study tour  A rather naïve algorithm for shared-memory processors  Algorithm 1 for distributed-memory processors  Algorithm 2 for distributed-memory processors 13

TXST TUES Module : # A Rather Naïve Algorithm for Shared- Memory Processors Quick take on shared-memory architecture:  Multiple processors access common pool of memory via 1 shared bus  Higher risk of running into bus contention  Multiple processes (one per processor) execute in parallel  Items to be sorted → stored in global array  Global stack stores indices of sub-arrays (unsorted in general)  Each idling process checks whether stack is nonempty and if so…  Pops indices of a sub-array  Done (returns) if size of sub-array is 1 (already sorted), otherwise.. …partitions sub-array into 2 and …  Pushes indices of 1 partitioned sub-array (unsorted in general) onto stack  Apply same treatment to the other sub-array (unsorted in general) (Note: Each process locks/unlocks stack before/after pushing or popping.) 14

TXST TUES Module : # A Rather Naïve Algorithm for Shared- Memory Processors  Can achieve only rather mediocre speedup:  Analytical estimates:  1023 data items, 8 processors: less than 4  data items, 16 processors: less than 6  Observed speedups on an actual machine: worse off  Most important reason:  Low amount of parallelism early in the algorithm’s execution  Partitioning of original unsorted array → significant sequential component  Limits speedup achievable (Amdahl’s law), regardless of # of processors  Cold start → low or no parallelism early, high parallelism near the end  Another factor:  Shared global stack causes contention among processors 15

TXST TUES Module : # Definition of “Sorted” for Distributed- Memory Processors 16

TXST TUES Module : # Algorithm 1 for Distributed-Memory Processors 17

TXST TUES Module : # Algorithm 1 for Distributed-Memory Processors 18

TXST TUES Module : # Algorithm 1 for Distributed-Memory Processors 19

TXST TUES Module : # Algorithm 1 for Distributed-Memory Processors  Observations:  Execution time dictated by when last processor completes  Likely to do poor job at job balancing  Number of elements each processor must sort  Low likelihood for pivot to be true median (or near it)  (recall: good quicksort performance obtained if median used as pivot)  Improvements?  Better job balancing?  (each processor sort an evenly distributed # of items?)  Better pivot?  (better likelihood to be true median or something closer to true median?) 20

TXST TUES Module : # Algorithm 2 for Distributed-Memory Processors 21

TXST TUES Module : # Algorithm 2 for Distributed-Memory Processors 22

TXST TUES Module : # Algorithm 2 for Distributed-Memory Processors 23

TXST TUES Module : # Algorithm 2 for Distributed-Memory Processors  Observations:  Starts where Algorithm 1 ends  Each processor sorts items it controls  When done → “sortedness” condition 1 is met  Processes must still exchange items  To meet “sortedness” condition 2  Process can use median of its sorted items as pivot  This is much more likely to be close to true median  Even better pivot?  Median of medians?  (empirically found to be not attractive)  (incurs extra communication costs → offsets any computational gains) 24

TXST TUES Module : # Acknowledgments  A major part of this note is derived from the material of others  Material in one form or another  Especially material that has been made available online  In the spirit of parallelism, gratitude is hereby conveyed in parallel  To each and every one whose material has been exploited  In one way or another (for educational purpose)  A closing example of “parallelism exploitation” it may well serve! 25