CPU Efficiency Issues.

CPU Efficiency Issues

Many find_paths? deliveryOrder was {2, 3, 4, 5, 6, 0, 7, 1}
How many calls to find_path to figure out new travel route and travel time? 4 Say each takes 0.1 s 0.4 s for one perturbation! 30 s CPU time limit Yikes!

Ideas? 1. Use geometric distance to estimate travel time?
And only call find_path at end, with final order? Reasonable, but will not be perfectly accurate Will cost us some quality

Inaccurate travel time estimate
Time estimate: ~300 m / 60 km per hour Real time: much higher

Ideas? 2. Pre-compute all shortest paths you might need?
Then just look up delays during pertubations How many shortest paths to pre-compute? Using single-source to single-dest find_path: Need any delivery location to any other travel time: 2N * (2N-1)  39,800 calls for N = 100 Plus any depot to any delivery location M * 2N  2000 calls for N = 100, M = 10 Plus any delivery location to any depot 2N * M  2000 calls Total: 43,800 calls to your find_path 0.1 s per call  4380 s  too slow

Dijkstra’s Algorithm What do we know at this point?
Shortest path to every intersection in the blue circle

Dijkstra’s Algorithm Keep Going!

Pre-Computing Travel Time Paths
Using single-source to all destinations Need any delivery location to any other 2N calls  200 if N = 100 Plus any depot to any delivery location M calls  10 Plus any delivery location to any depot 0 calls Total: 210 calls Is this the minimum? No, with small change can achieve: 201 calls Get from earlier call

Is This Fast Enough? Recall: Total:
Dijkstra’s algorithm can search whole graph And usually will with multiple destinations O(N) items to put in wavefront Using heap / priority_queue: O (log N) to add / remove 1 item from wavefront Total: N log N Can execute in ~0.1 s OK!

Managing the Time Limit
Can get better results with more CPU time Multi-start: Run algorithm again with a new starting point Keep best Iterative improvement Keep looking for productive changes But 30 s limit to return solution For every problem Regardless of city size & num intersections How many starting points? How many iterative perturbations? Solution: check how much time has passed End optimization just before time limit

Time Limit #include <chrono> // Time utilities using namespace std; #define TIME_LIMIT 30 // m4: 30 second time limit int main ( ) { auto startTime = chrono::high_resolution_clock::now(); bool timeOut = false; while (!timeOut) { myOptimizer (); auto currentTime = chrono::high_resolution_clock::now(); auto wallClock = chrono::duration_cast<chrono::duration<double>> ( currentTime - startTime); // Keep optimizing until within 10% of time limit if (wallClock.count() > 0.9 * TIME_LIMIT) timeOut = true; } ... Inside namespace chrono Static member function: can call without an object Time difference Gives actual elapsed time no matter how many CPUs you are using Time difference in seconds

Multithreading Why & How

Intel 8086 First PC microprocessor 1978 29,000 transistors 5 MHz
~10 clocks / instruction ~500,000 instructions / s

Intel Core i7 – “Skylake”
2015 1.5 billion transistors 4.0 GHz ~15 clocks / instruction, but ~30 instructions in flight Average ~2 instructions completed / clock ~8 billion instructions / s die photo: courtesy techpowerup.com

CPU Scaling: Past & Future
1978 to 2015 50,000x more transistors 16,000x more instructions / s The future: Still get 2X transistors every years But transistors not getting much faster  CPU clock speed saturating ~30 instructions in flight Complexity & power to go beyond this climbs rapidly Slow growth in instructions / cycle Impact: CPU speed no longer increasing rapidly But can fit many processors (cores) on a single chip Using multiple cores now important Multithreading: one program using multiple cores at once

A Single-Threaded Program
Memory Instructions (code) CPU / Core Program Counter Global Variables Stack Pointer Heap Variables (new) . . . Stack (local variables)

A Multi-Threaded Program
Memory Instructions (code) Core1 thread 1 Program Counter Global Variables Stack Pointer Shared by all threads Core2 thread 2 Heap Variables (new) Program Counter . . . Stack Pointer Stack1 (local variables) Each thread gets own local variables Stack2 (local variables)

Thread Basics Each thread has own program counter
Can be executing a different function Is (almost always) executing a different instruction from other threads Each thread has own stack Has its own copy of local variables (all different) Each thread sees same global variables Dynamically allocated memory Shared by all threads Any thread with a pointer to it can access

Implications Threads can communicate through memory
Global variables Dynamically allocated memory Fast communication! Must be careful threads don’t conflict in reads/write to same memory What if two threads update the same global variable at the same time? Not clear which update wins! Can have more threads than CPUs Time share the CPUs

Multi-Threading Libraries
Program start: 1 “main” thread created Need more: create with API/library Options: 1. Open MP Compiler directives: higher level code Will compile and run serially if compiler doesn’t support open MP #pragma omp parallel for for (int i=0; i < 10; i++) { 2. C++11 threads More control, but lower-level code #include <thread> // C++ 11 feature int main () { thread myThread (start_func_for_thread); 3. pthreads Same as C++11 threads, but nastier syntax

Convert to Use Multiple Threads
int main() { vector<int> a(100), b(100); ... for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl;

Convert to Use Multiple Threads
These variables shared by all threads int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl; Local variables declared within block  separate per thread

Output? a[0] is 52 a[1] is 52 a[2] is a[25] is 52 a[26] is 52 52 a[27] a[50] is 52 is 52 ... std::cout… Global variable Being shared by multiple threads Not “thread safe” Each << is a function call to operator<< Randomly interleaving output

What Did OpenMP Do? Split loop iterations across threads (4 to 8 for UG machines) // thread 1 for (int i = 0; i < 24; i++) { cout << a[i] << endl; } // thread 2 for (int i = 25; i < 49; i++) {

Fixing the Output Problem
int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { #pragma omp critical cout << “a[“ << i << “] is “ << a[i] << endl; } Only one thread at a time can execute this block Make everything critical?  Destroys your performance (serial again)

M4: What to Multithread? Want: Candidates?
Takes much (ideally majority of) CPU time Work independent  no or few dependencies Candidates? Path finding 2N invocations of multi-destination Dijkstra All doing independent work Biggest CPU time for most teams Multi-start of order optimization algorithm Independent work Large CPU time if complex algorithm Multiple perturbations to order at once? Harder: need critical section to update best solution Maybe update only periodically

How Much Speedup Can I Get?
Path order: 40% time Pathfinding: 60% time Make parallel? 4 cores  pathfinding time drops by at most 4 Total time = / 4 Total time = 0.55 of serial 1.8x faster Limited by serial code  Amdahl’s law Can run 8 threads with reduced performance (hyperthreading)  time  0.51 serial in this case

Gotcha? Writing to global variable in multiple threads?
vector<int> reaching_edge; // For all intersections // how did I reach you? #pragma omp parallel for for (int i = 0; i < deliveries.size(); i++) { path_times = multi_dest_dijkstra ( i, deliveries); } Writing to global variable in multiple threads? Refactor code to make local (best) Use omp critical only if truly necessary Make local

Keep It Simple! Do not try multi-threading until you have a good serial implementation! Much harder to write and debug multi-threaded code Do not need to multi-thread to have a good implementation Resources OpenMP Tutorial Debugging threads with gdb

CPU Efficiency Issues.

Similar presentations

Presentation on theme: "CPU Efficiency Issues."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPU Efficiency Issues.

Similar presentations

Presentation on theme: "CPU Efficiency Issues."— Presentation transcript:

Similar presentations

About project

Feedback