6/16/2010 Parallel Performance Parallel Performance.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Parallel Processing Problems Cache Coherence False Sharing Synchronization.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
CS 106 Introduction to Computer Science I 10 / 15 / 2007 Instructor: Michael Eckmann.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Recursion, Complexity, and Searching and Sorting By Andrew Zeng.
CSCE 3110 Data Structures & Algorithm Analysis Sorting (I) Reading: Chap.7, Weiss.
Lecturer: Dr. AJ Bieszczad Chapter 11 COMP 150: Introduction to Object-Oriented Programming 11-1 l Basics of Recursion l Programming with Recursion Recursion.
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
CSIS 123A Lecture 9 Recursion Glenn Stevenson CSIS 113A MSJC.
© M. Gross, ETH Zürich, 2014 Informatik I für D-MAVT (FS 2014) Exercise 12 – Data Structures – Trees Sorting Algorithms.
Java Programming: Guided Learning with Early Objects Chapter 11 Recursion.
Threads Tutorial #7 CPSC 261. A thread is a virtual processor Each thread is provided the illusion that it owns a core – Copy of the registers – It is.
CS 284a, 29 October 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 29 October, 1997.
Chapter 11Java: an Introduction to Computer Science & Programming - Walter Savitch 1 Chapter 11 l Basics of Recursion l Programming with Recursion Recursion.
1 Searching and Sorting Searching algorithms with simple arrays Sorting algorithms with simple arrays –Selection Sort –Insertion Sort –Bubble Sort –Quick.
Advanced Computer Networks Lecture 1 - Parallelization 1.
CS 106 Introduction to Computer Science I 03 / 02 / 2007 Instructor: Michael Eckmann.
Intro To Algorithms Searching and Sorting. Searching A common task for a computer is to find a block of data A common task for a computer is to find a.
CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.
Searching and Sorting Searching algorithms with simple arrays
Prof. U V THETE Dept. of Computer Science YMA
Lecture 25: Searching and Sorting
Chapter 13 Recursion Copyright © 2016 Pearson, Inc. All rights reserved.
Sorting Mr. Jacobs.
Concurrency 2 CS 2110 – Spring 2016.
Software Coherence Management on Non-Coherent-Cache Multicores
Subject Name: Design and Analysis of Algorithm Subject Code: 10CS43
Lecture No.45 Data Structures Dr. Sohail Aslam.
Lecture 7.
Lecture 14 Searching and Sorting Richard Gesick.
Introduction to Search Algorithms
CSC 222: Object-Oriented Programming
COMP 53 – Week Seven Big O Sorting.
Simple Sorting Algorithms
Atomic Operations in Hardware
Atomic Operations in Hardware
Chapter 7 Sorting Spring 14
Algorithm Analysis CSE 2011 Winter September 2018.
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
More on Merge Sort CS 244 This presentation is not given in class
Sorting Chapter 13 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved
Quicksort and Mergesort
Common Correctness and Performance Issues
Chapter 4: Divide and Conquer
Designing Parallel Algorithms (Synchronization)
Unit-2 Divide and Conquer
Priority Queues.
Lecture 11 Searching and Sorting Richard Gesick.
Searching and Sorting Arrays
Search,Sort,Recursion.
Topic 24 sorting and searching arrays
Basics of Recursion Programming with Recursion
Analysis of Algorithms
Search,Sort,Recursion.
CSE 373 Data Structures and Algorithms
Mattan Erez The University of Texas at Austin
Priority Queues.
Data Structures & Algorithms
These notes were largely prepared by the text’s author
Analysis of Algorithms
Parallelism Can we make it faster? 8-May-19.
CSCE 3110 Data Structures & Algorithm Analysis
CSCE 3110 Data Structures & Algorithm Analysis
Chapter 13 Recursion Copyright © 2010 Pearson Addison-Wesley. All rights reserved.
CSCE 3110 Data Structures & Algorithm Analysis
Divide and Conquer Merge sort and quick sort Binary search
Analysis of Algorithms
Stacks, Queues, ListNodes
Presentation transcript:

6/16/2010 Parallel Performance Parallel Performance

Performance Considerations 6/16/2010 Parallel Performance Performance Considerations I parallelized my code and … The code slower than the sequential version, or I don’t get enough speedup What should I do?

Performance Considerations 6/16/2010 Parallel Performance Performance Considerations Algorithmic Fine-grained parallelism True contention False contention Other Bottlenecks

Algorithmic Bottlenecks 6/16/2010 Parallel Performance Algorithmic Bottlenecks Parallel algorithms might sometimes be completely different from their sequential counterparts Might need different design and implementation

Algorithmic Bottlenecks 6/16/2010 Parallel Performance Algorithmic Bottlenecks Parallel algorithms might sometimes be completely different from their sequential counterparts Might need different design and implementation Example: MergeSort (once again )

Recall from Previous Lecture 6/16/2010 Parallel Performance Recall from Previous Lecture Merge was the bottleneck 1 8 1 16 1 8

Most Efficient Sequential Merge is not Parallelizable 6/16/2010 Parallel Performance Most Efficient Sequential Merge is not Parallelizable Merge(int* a, int* b, int* result ){ while( <end condition> ) { if(*a <= *b){ *result++ = *a++; } else { *result++ = *b++;

Parallel Merge Algorithm 6/16/2010 Parallel Performance Parallel Merge Algorithm Merge two sorted arrays A and B using divide and conquer

Parallel Merge Algorithm 6/16/2010 Parallel Performance Parallel Merge Algorithm Merge two sorted arrays A and B Let A be the larger array (else swap A and B) Let n be the size of A Split A into A[0…n/2-1], A[n/2], A[n/2+1…n] Do binary search to find smallest m such that B[m] >= A[n/2] Split B into B[0...m-1], B[m,..] return Merge(A[0…n/2-1], B[0…m-1]), A[n/2], Merge(A[n/2+1…n], B[m…])

Assignment 1 Extra Credit 6/16/2010 Parallel Performance Assignment 1 Extra Credit Implement the Parallel Merge algorithm and measure performance improvement with your work stealing queue implementation

Fine Grained Parallelism 6/16/2010 Parallel Performance Fine Grained Parallelism Overheads of Tasks Each Task uses some memory (not as much resources as threads, though) Work stealing queue operations If the work done in each task is small, then the overheads might not justify the improvements in parallelism

False Sharing Data Locality & Cache Behavior 6/16/2010 Parallel Performance False Sharing Data Locality & Cache Behavior Performance of computation depends HUGELY on how well the cache is working Too many cache misses, if processors are “fighting” for the same cache lines Even if they don’t access the same data

6/16/2010 Parallel Performance Cache Coherence P1 P2 P3 P3 P1 P2 P3 P3 P1 P2 P3 P3 i i i i s i s s i i x i Each cacheline, on each processor, has one of these states: i - invalid : not cached here s - shared : cached, but immutable x - exclusive: cached, and can be read or written State transitions require communication between caches (cache coherence protocol) If a processor writes to a line, it removes it from all other caches

Ping-Pong & False Sharing 6/16/2010 Parallel Performance Ping-Pong & False Sharing Ping-Pong If two processors both keep writing to the same location, cache line has to go back and forth Very inefficient (lots of cache misses) False Sharing Two processors writing to two different variables may happen to write to the same cacheline If both variables are allocated on the same cache line Get ping-pong effect as above, and horrible performance

False Sharing Example void WithFalseSharing() { 6/16/2010 Parallel Performance False Sharing Example void WithFalseSharing() { Random rand1 = new Random(), rand2 = new Random(); int[] results1 = new int[20000000], results2 = new int[20000000]; Parallel.Invoke( () => { for (int i = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, for (int i = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); }

False Sharing Example void WithFalseSharing() { 6/16/2010 Parallel Performance rand1, rand2 are allocated at same time => likely on same cache line. False Sharing Example void WithFalseSharing() { Random rand1 = new Random(), rand2 = new Random(); int[] results1 = new int[20000000], results2 = new int[20000000]; Parallel.Invoke( () => { for (int i = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, for (int i = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); } Call to Next() writes to the random object => Ping-Pong Effect

False Sharing, Eliminated? 6/16/2010 Parallel Performance False Sharing, Eliminated? rand1, rand2 are allocated by different tasks => Not likely on same cache line. void WithoutFalseSharing() { int[] results1, results2; Parallel.Invoke( () => { Random rand1 = new Random(); results1 = new int[20000000]; for (int i = 0; i < results1.Length; i++) results1[i] = rand1.Next(); }, Random rand2 = new Random(); results2 = new int[20000000]; for (int i = 0; i < results2.Length; i++) results2[i] = rand2.Next(); }); }