CPU Efficiency Issues.

Slides:



Advertisements
Similar presentations
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Dijkstra’s Algorithm Keep Going!. Pre-Computing Shortest Paths How many paths to pre-compute? Recall: –Using single-source to single-dest find_path: Need.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CIS 101: Computer Programming and Problem Solving Lecture 8 Usman Roshan Department of Computer Science NJIT.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multi-Core Architectures
Independent Component Analysis (ICA) A parallel approach.
Games Development 2 Concurrent Programming CO3301 Week 9.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
COMPUTER PROGRAMMING. Iteration structures (loops) There may be a situation when you need to execute a block of code several number of times. In general,
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Tuning Threaded Code with Intel® Parallel Amplifier.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Traveling Courier / Milestone 4 Continued. Recall Pre-compute all shortest paths you might need? –Then just look up delays during pertubations How many.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Two notions of performance
Bill Tucker Austin Community College COSC 1315
Measuring Where CPU Time Goes
Depth First Seach: Output Fix
Multi-core processors
Atomic Operations in Hardware
Atomic Operations in Hardware
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Exploiting Parallelism
Introduction to Algorithms
Multi-core processors
Algorithm Analysis CSE 2011 Winter September 2018.
Tutorial 8 An optional eighth tutorial will be held the week of March 6. This tutorial will give you practice with and feedback on oral presentation and.
/ Computer Architecture and Design
Multi-Processing in High Performance Computer Architecture:
Parallel Computers.
CMSC 341 Prof. Michael Neary
Implementation of IDEA on a Reconfigurable Computer
CSCI1600: Embedded and Real Time Software
Multi-core CPU Computing Straightforward with OpenMP
Discussion section #2 HW1 questions?
Pointers, Dynamic Data, and Reference Types
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Searching, Sorting, and Asymptotic Complexity
CSE 373 Data Structures and Algorithms
Milestone 3: Finding Routes
Multithreading Why & How.
M4 and Parallel Programming
Mental Health and Wellness Resources
Not guaranteed to find best answer, but run in a reasonable time
Chapter 12 Pipelining and RISC
Min Heap Update E.g. remove smallest item 1. Pop off top (smallest) 3
Dynamic Memory.
C. M. Overstreet Old Dominion University Spring 2006
Efficiently Estimating Travel Time
Memory System Performance Chapter 3
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
A Level Computer Science Topic 5: Computer Architecture and Assembly
CSE 373: Data Structures and Algorithms
C. M. Overstreet Old Dominion University Fall 2007
CSCI1600: Embedded and Real Time Software
Pointer analysis John Rollinson & Kaiyuan Li
Presentation transcript:

CPU Efficiency Issues

Many find_paths? deliveryOrder was {2, 3, 4, 5, 6, 0, 7, 1} How many calls to find_path to figure out new travel route and travel time? 4 Say each takes 0.1 s 0.4 s for one perturbation! 30 s CPU time limit Yikes!

Ideas? 1. Use geometric distance to estimate travel time? And only call find_path at end, with final order? Reasonable, but will not be perfectly accurate Will cost us some quality

Inaccurate travel time estimate Time estimate: ~300 m / 60 km per hour Real time: much higher

Ideas? 2. Pre-compute all shortest paths you might need? Then just look up delays during pertubations How many shortest paths to pre-compute? Using single-source to single-dest find_path: Need any delivery location to any other travel time: 2N * (2N-1)  39,800 calls for N = 100 Plus any depot to any delivery location M * 2N  2000 calls for N = 100, M = 10 Plus any delivery location to any depot 2N * M  2000 calls Total: 43,800 calls to your find_path 0.1 s per call  4380 s  too slow

Dijkstra’s Algorithm What do we know at this point? Shortest path to every intersection in the blue circle

Dijkstra’s Algorithm Keep Going!

Pre-Computing Travel Time Paths Using single-source to all destinations Need any delivery location to any other 2N calls  200 if N = 100 Plus any depot to any delivery location M calls  10 Plus any delivery location to any depot 0 calls Total: 210 calls Is this the minimum? No, with small change can achieve: 201 calls Get from earlier call

Is This Fast Enough? Recall: Total: Dijkstra’s algorithm can search whole graph And usually will with multiple destinations O(N) items to put in wavefront Using heap / priority_queue: O (log N) to add / remove 1 item from wavefront Total: N log N Can execute in ~0.1 s OK!

Managing the Time Limit Can get better results with more CPU time Multi-start: Run algorithm again with a new starting point Keep best Iterative improvement Keep looking for productive changes But 30 s limit to return solution For every problem Regardless of city size & num intersections How many starting points? How many iterative perturbations? Solution: check how much time has passed End optimization just before time limit

Time Limit #include <chrono> // Time utilities using namespace std; #define TIME_LIMIT 30 // m4: 30 second time limit int main ( ) { auto startTime = chrono::high_resolution_clock::now(); bool timeOut = false; while (!timeOut) { myOptimizer (); auto currentTime = chrono::high_resolution_clock::now(); auto wallClock = chrono::duration_cast<chrono::duration<double>> ( currentTime - startTime); // Keep optimizing until within 10% of time limit if (wallClock.count() > 0.9 * TIME_LIMIT) timeOut = true; } ... Inside namespace chrono Static member function: can call without an object Time difference Gives actual elapsed time no matter how many CPUs you are using Time difference in seconds

Multithreading Why & How

Intel 8086 First PC microprocessor 1978 29,000 transistors 5 MHz ~10 clocks / instruction ~500,000 instructions / s

Intel Core i7 – “Skylake” 2015 1.5 billion transistors 4.0 GHz ~15 clocks / instruction, but ~30 instructions in flight Average ~2 instructions completed / clock ~8 billion instructions / s die photo: courtesy techpowerup.com

CPU Scaling: Past & Future 1978 to 2015 50,000x more transistors 16,000x more instructions / s The future: Still get 2X transistors every 2 - 3 years But transistors not getting much faster  CPU clock speed saturating ~30 instructions in flight Complexity & power to go beyond this climbs rapidly Slow growth in instructions / cycle Impact: CPU speed no longer increasing rapidly But can fit many processors (cores) on a single chip Using multiple cores now important Multithreading: one program using multiple cores at once

A Single-Threaded Program Memory Instructions (code) CPU / Core Program Counter Global Variables Stack Pointer Heap Variables (new) . . . Stack (local variables)

A Multi-Threaded Program Memory Instructions (code) Core1 thread 1 Program Counter Global Variables Stack Pointer Shared by all threads Core2 thread 2 Heap Variables (new) Program Counter . . . Stack Pointer Stack1 (local variables) Each thread gets own local variables Stack2 (local variables)

Thread Basics Each thread has own program counter Can be executing a different function Is (almost always) executing a different instruction from other threads Each thread has own stack Has its own copy of local variables (all different) Each thread sees same global variables Dynamically allocated memory Shared by all threads Any thread with a pointer to it can access

Implications Threads can communicate through memory Global variables Dynamically allocated memory Fast communication! Must be careful threads don’t conflict in reads/write to same memory What if two threads update the same global variable at the same time? Not clear which update wins! Can have more threads than CPUs Time share the CPUs

Multi-Threading Libraries Program start: 1 “main” thread created Need more: create with API/library Options: 1. Open MP Compiler directives: higher level code Will compile and run serially if compiler doesn’t support open MP #pragma omp parallel for for (int i=0; i < 10; i++) { 2. C++11 threads More control, but lower-level code #include <thread> // C++ 11 feature int main () { thread myThread (start_func_for_thread); 3. pthreads Same as C++11 threads, but nastier syntax

Convert to Use Multiple Threads int main() { vector<int> a(100), b(100); ... for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl;

Convert to Use Multiple Threads These variables shared by all threads int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl; Local variables declared within block  separate per thread

Output? a[0] is 52 a[1] is 52 a[2] is a[25] is 52 a[26] is 52 52 a[27] a[50] is 52 is 52 ... std::cout… Global variable Being shared by multiple threads Not “thread safe” Each << is a function call to operator<< Randomly interleaving output

What Did OpenMP Do? Split loop iterations across threads (4 to 8 for UG machines) // thread 1 for (int i = 0; i < 24; i++) { cout << a[i] << endl; } // thread 2 for (int i = 25; i < 49; i++) {

Fixing the Output Problem int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { #pragma omp critical cout << “a[“ << i << “] is “ << a[i] << endl; } Only one thread at a time can execute this block Make everything critical?  Destroys your performance (serial again)

M4: What to Multithread? Want: Candidates? Takes much (ideally majority of) CPU time Work independent  no or few dependencies Candidates? Path finding 2N invocations of multi-destination Dijkstra All doing independent work Biggest CPU time for most teams Multi-start of order optimization algorithm Independent work Large CPU time if complex algorithm Multiple perturbations to order at once? Harder: need critical section to update best solution Maybe update only periodically

How Much Speedup Can I Get? Path order: 40% time Pathfinding: 60% time Make parallel? 4 cores  pathfinding time drops by at most 4 Total time = 0.4 + 0.6 / 4 Total time = 0.55 of serial 1.8x faster Limited by serial code  Amdahl’s law Can run 8 threads with reduced performance (hyperthreading)  time  0.51 serial in this case

Gotcha? Writing to global variable in multiple threads? vector<int> reaching_edge; // For all intersections // how did I reach you? #pragma omp parallel for for (int i = 0; i < deliveries.size(); i++) { path_times = multi_dest_dijkstra ( i, deliveries); } Writing to global variable in multiple threads? Refactor code to make local (best) Use omp critical only if truly necessary Make local

Keep It Simple! Do not try multi-threading until you have a good serial implementation! Much harder to write and debug multi-threaded code Do not need to multi-thread to have a good implementation Resources OpenMP Tutorial Debugging threads with gdb