A Perspective Hardware and Software

Slides:



Advertisements
Similar presentations
Parallel Algorithms.
Advertisements

SE-292 High Performance Computing
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Efficient Parallel Algorithms COMP308
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Overview Efficient Parallel Algorithms COMP308. COMP 308 Exam Time allowed : 2.5 hours Answer four questions (out of six). If you attempt to answer more.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Introduction to Parallel Processing Ch. 12, Pg
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
A Perspective Hardware and Software
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.
COMP308 Efficient Parallel Algorithms
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
RAM, PRAM, and LogP models
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Data Structures and Algorithms in Parallel Computing Lecture 2.
Data Structures and Algorithms in Parallel Computing Lecture 1.
HYPERCUBE ALGORITHMS-1
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
These slides are based on the book:
PRAM and Parallel Computing
Advanced Algorithms Analysis and Design
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
Distributed and Parallel Processing
Top 50 Data Structures Interview Questions
PRAM Model for Parallel Computation
COMPUTATIONAL MODELS.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Parallel Programming By J. H. Wang May 2, 2017.
Lecture 2: Parallel computational models
Introduction to parallel algorithms
Review Graph Directed Graph Undirected Graph Sub-Graph
Course Description Algorithms are: Recipes for solving problems.
PRAM Algorithms.
PRAM Model for Parallel Computation
PRAM architectures, algorithms, performance evaluation
Data Structures and Algorithms in Parallel Computing
Objective of This Course
CSE838 Lecture notes copy right: Moon Jung Chung
Introduction to parallel algorithms
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Sungho Kang Yonsei University
COMP60621 Fundamentals of Parallel and Distributed Systems
Advanced Implementation of Tables
Unit –VIII PRAM Algorithms.
Advanced Implementation of Tables
Module 6: Introduction to Parallel Computing
Course Description Algorithms are: Recipes for solving problems.
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to parallel algorithms
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

A Perspective Hardware and Software Parallel Processing A Perspective Hardware and Software

Introduction to PP From “Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design”, R. Halverson Chapter 1 – Introduction to Parallel Processing

Parallel Processing Research 1980’s – Great Deal of Research & Publications 1990’s – Hardware not too successful so the research area “dies” – Why??? Early 2000’s – Begins Resurgence? Why??? Will it continue to be successful this time ???

Goal of PP Why bother with Parallel Processing? Goal: Solve problems faster! In reality, faster but efficient! Work-Optimal: parallel algorithm runs faster than sequential algorithm in proportion to the number of processors used. Sometimes work-optimal is seemingly not possible

PP Issues Processors: number, connectivity, communication Memory: shared vs. local Data structures Data Distribution Problem Solving Strategies

Parallel Problems One Approach: Try to develop a parallel solution to a problem without consideration of the hardware. Apply: Apply the solution to the specific hardware and determine the extra cost, if any If not acceptably efficient, try again!

Parallel Problems Another Approach: Armed with the knowledge of strategies, data structures, etc. that work well for a particular hardware, develop a solution with a specific hardware in mind. Third Approach: Modify a solution for one hardware configuration for another

Real World Problems Inherently Parallel – nature or structure of the problem lends itself to parallelism Examples Mowing a lawn Cleaning a house Grading papers Problems are easily divided into sub-problems; very little overhead

Real World Problems Not Inherently Parallel – parallelism is possible but more complex to define or with (excessive) overhead cost Examples Balancing a checkbook Giving a haircut Wallpapering a room

Some Computer Problems Are these “inherently parallel” or not? Processing customers’ monthly bills Payroll checks Building student grade reports from class grade sheets Searching for an item in a linked list A video game program Searching a state driver’s license database Is problem hw, sw, data? Assumptions?

General Observations What characteristics make a problem inherently parallel? What characteristics make a problem difficult to parallelize? * Consider hw, sw, data structures.

Payroll Problem Consider 10 PCs with each employee’s information stored in a row of an array, A. Label P0, P1,…P9 A[100] – 0 to 99 For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1]

Code for Payroll For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1] Each PC runs a process in parallel For each Pi , i = 0 to 9 do //separate process For j = 0 to 9 Process A[i*10 + j]

Time Complexity of Payroll Algorithm Consider P processors Consider N data items Each PC has N/P data items Assume data is accessible & writeable to each PC Time: O(N/P)

Payroll Questions?? Now we have a solution, must be applied to hardware. Which hardware? Main question: Where is the array and how is it accessed by each processor? One shared memory or many local memories? Where are the results placed?

What about I/O?? Generally, in parallel algorithms, I/O is disregarded. Assumption: Data is stored in the available memory. Assumption: Results are written back to memory. Data input and output are generally independent of the processing algorithm.

Balancing a Checkbook Consider same hardware & data array Can still distribute and process in the same manner as the payroll Each block computes deposits as addition & checks a subtraction; totals the page (10 totals) BUT then must combine the 10 totals to the final total This is the overhead

Complexity of Checkbook Consider P processors Consider N data items Each PC has N/P data items Assume data is accessible & writeable Time for each section: O(N/P) Combination of P subtotals Time for combining: O(P) to O(log P) Total: O(N/P + P) to O(N/P + log P)

Performance Complexity - Perfect Parallel Algorithm If the best sequential algorithm for a problem is O(f(x)) then the parallel algorithm would be O(f(x)/P) This happens if little or no overhead Actual Run Time Typically, takes 4 processors to achieve ½ the actual run time

Performance Measures Run Time: not a practical measurement Assume T1 & Tp are run times using 1 & p processors, respectively Speedup: S = T1/Tp Work: W = p * Tp (aka Cost) If W = O(T1) the it is Work (Cost) Optimal & achieves Linear Speedup

Scalability An algorithm is said to be Scalable if performance increases linearly with the number of processors Implication: Algorithm sustains good performance over a wide range of processors.

Scalability What about continuing to add processors? At what point does adding more processors stop improving the run time? Does adding processors ever cause the algorithm to take more time? What is the optimal number of processors? Consider W = p * Tp = O(T1) Solve for p

Models of Computation Two major categories Shared memory PRAM Fixed connection Hypercube There are numerous versions of each Not all are totally realizable in hw

Sidenote: Models Distributed Computing Use of 2 or more separate computers used to solve a single problem A version of a network Clusters This is not really a topic for this course

Shared Memory Model PRAM – parallel random access machine A category with 4 variants EREW-CREW-ERCW-CRCW All communication through a shared global memory Each PC has a small local memory

Variants of PRAM EREW-CREW-ERCW-CRCW Concurrent read: 2 or more processors may read the same (or different) memory location simultaneously Exclusive read: 2 or more processors may access global memory location only if each is accessing a unique address Similarly defined for write

Shared Memory Model P0 P1 P2 P3 Shared Global Memory

Shared Memory What are some implications of the variants in memory access of the PRAM model? What is the strongest model?

Fixed Connection Models Each PC contains a Local Memory Distributed memory PCs are connected through some type of Interconnection Network Interconnection network defines the model Communication is via Message Passing Can be synchronous or asynchronous

Interconnection Networks Bus Network (Linear) Ring Mesh Torus Hypercube

Hypercube Model Distributed memory, message passing, fixed connection, parallel computer N = 2r number of nodes E = r 2r-1 number of edges Nodes are number 0 – N in binary such that any 2 nodes differing in one bit are connected by an edge Dimension is r

Hypercube Examples N = 2, 4 N = 2 Dimension = 1 N = 4 Dimension = 2 10 11 1 00 01 N = 2 Dimension = 1 N = 4 Dimension = 2

Hypercube Example N = 8 N = 8 Dimension = 3 111 110 010 011 100 101 001 000

Hypercube Considerations Message Passing Communication Possible Delays Load Balancing Each PC has same work load Data Distribution Must follow connections

Consider Checkbook Problem How about distribution of data? Often initial distribution is disregarded What about the combination of the subtotals? Reduction is by dimension

Design Strategies Paradigm: a general strategy used to aid in the development of the solution to a problem

Paradigms Extended from Sequential Use Divide-and-Conquer Branch-and-Bound Dynamic Programming

Paradigms Developed for Parallel Use Deterministic coin tossing Symmetry breaking Accelerating cascades Tree contraction Euler Tours Linked List Ranking All Nearest Smaller Values (ANSV) Parentheses Matching

Divide-and-Conquer Most basic parallel strategy Used in virtually every parallel algorithm Problem is divided into several sub-problems that can be solved independently; results of sub-problems are combined into the final solution Example: Checkbook Problem

Dynamic Programming Divide-n-conquer technique used when sub-problems are not independent; share common sub-problems Sub-problem solutions are stored in table for use by other processes Often used for optimization problems Minimum or Maximum Fibonacci Numbers

Branch-and-Bound Breadth-first tree processing technique Uses a bounding function that allows some branches of the tree to be pruned (i.e. eliminated) Example: Game programming

Symmetry Breaking Strategy that breaks a linked structure (e.g. linked list) into disjoint pieces for processing Deterministic Coin Tossing Using a binary representation of index, nonadjacent elements are selected for processing Often used in Linked List Ranking Algorithms

Accelerated Cascades Applying 2 or more algorithms to a single problem, Change from one to another based on the ratio of the problem size to the number of processors – Threshold This “fine tuning” sometimes allows for better performance

Tree Contraction Nodes of a tree are removed; information removed is combined with remaining nodes’ Multiple processors are assigned to independent nodes Tree is reduced to a single node which contains the solution E.G. Arithmetic Expression Computation

Euler Tour Create duplicate nodes in a tree or graph with edge in opposite direction to create a circuit Allows tree or graph to be processed as a linked list

Linked List Ranking Halverson’s area of dissertation research Technique to number, in order, the elements of a linked list (20+) Applied to a wide range of problems (23) Euler Tours -- Tree Traversals Tree Searches -- Spanning Trees & Forests List Packing -- Connected Components Connectivity -- Graph Decomposition

All Nearest Smaller Values For each value x, which elements are smaller than x Successfully applied to Depth first search of interval graph Parentheses matching Line Packing Triangulating a monotone polynomial

Parentheses Matching In a properly formed string of parentheses, find the index of each parentheses mate Applied to solve Heights of all nodes in a tree Extreme values in a tree Lowest common ancestor Balancing binary trees

Parallel Algorithm Design Identify problems and/or classes of problems for which a particular strategy will work Apply to the appropriate hardware Most of the paradigms have been optimized for a variety of parallel architectures

Broadcast Operation Not a paradigm, but an operation used in many parallel algorithms Provide one or more items of data to all the processors (individual memories) Let P be the number of processors. For most models, broadcast operation is O(log P) time complexity

Broadcast Shared Memory (EREW) Hypercube Both are O(log P) P0 writes for P1; P0 & P1 write for P2 & P3; P0 – P3 write for P4 – P7 Then each PC has a copy to be read in one time unit Hypercube P0 sends to P1; P0 & P1 send to P2 & P3, etc. Both are O(log P)

Remainder of this Course Cover Chapter 1 & 2 Cover parts of Chapters 3, 4, 5 Cover Chapter 6 Other Chapters to be determined Graduate Student Presentations Videos Exams, Homework, Quizzes